Session 1: Welcome!
January 18, 2023
Data science is an exciting discipline that allows you to turn raw data into understanding, insight, and knowledge.
Wickham, H. & Grolemund, G. (2017). R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media. Henceforth: R4DS
freely available online (Note: we will be using the in-progress 2nd edition)
StackExchange Data Science user Stephan Kolassa CC BY-SA 4.0 via Wikimedia Commons
‘Hal Varian, the chief economist at Google, is known to have said, “The sexy job in the next 10 years will be statisticians. People think I’m joking, but who would’ve guessed that computer engineers would’ve been the sexy job of the 1990s?” If “sexy” means having rare qualities that are much in demand, data scientists are already there. They are difficult and expensive to hire and, given the very competitive market for their services, difficult to retain. There simply aren’t a lot of people with their combination of scientific background and computational and analytical skills.’
In academic research, across a wide range of disciplines, we’re also interested in turning “raw data into understanding, insight, and knowledge”, as well as in communicating our results!
You will also develop an understanding of how these tools help to foster open science, reproducible research and thus the ethical treatment of data.
These skills are readily generalisable across a wide range of domains.
Apart from the fact that we’re using R?
palmerpenguins R package (Horst, 2020) - don’t worry, you will learn more about what an R package is later| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|
| Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
| Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
| Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
| Adelie | Torgersen | NA | NA | NA | NA | NA | 2007 |
| Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
| Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 |
Open this web app: https://ibsneuro.shinyapps.io/palmer_penguins/
Tab 1 contains information about the data set and lets you inspect the data frame
Tab 2 allows you to generate plots by selecting the type of graph, which variables to put on the x and y axes and which variable to group by (using different colours)
In your exploration, consider the questions on the following slide
For each question, note down not only your answer but also the strategy you chose to get to it: how did you choose to construct your graph for the question and why?
If you wanted to predict a penguin’s body mass, which other attributes could you look at (e.g. flipper length, bill length, sex etc.)? In other words, which of the other attributes appear to be most predictive of body mass?
Is there a close relationship between bill length and bill depth?
Is it possible to look at effects of island (i.e. the environment in which the penguins live) independently of other factors such as species or sex? If not, why not?
Explore another 2 or 3 questions that interest you
Finally, reflect on what this exercise has shown you regarding the use of different graph types to address different questions
For each of the following challenges, go back to the raw document (i.e. the one that doesn’t look pretty 😄), try to figure out how to make the relevant change and then render the document using Knit to see whether you were correct!